1. LOAD REQUIRED PACKAGES

2. LOAD AND PREPARE THE DATA

2A. Load the processed data

2B. Load the raw data

3. EXPLORATORY DATA ANALYSIS

3A. Find outliers

There're no outliers in the train data

3B. Find the difference between classes in each feature

We used the features' median values to compare the differences between the two type of players.

Express the differences as variances

3B. Visualise the range of values by class in each feature

3C. Visualise the relationship between pair of features

3D. Visualise the correlation between pair of features

3E. Detect multicollinearity between features

4. APPROACH 1: USING PRINCIPLE COMPONENT ANALYSIS (PCA)

4A. Find the optimum number of components

The scree plot shows that 10 components can explain 98% of the variations in the data.

4B. Transform the train and test data into 10 PCA components

4C. Evaluate PCA performance

4D. Tune Logistic Regression (LG) with the PCA tranformed train data

4E. Train the tuned LG with the data oversmapled by SMOTE and save prediction result

5. APPROACH 2: USE DOMAIN KNOWLEDGE

5A. Method 1: based on CARMELO NBA Player Projection

Engineering new features

Evaluate the LG trained with the selected features

We didn't pursue this option further because it had lower cross validation AUC score compared to the the PCA option above.

5B. Method 2: based on Hollinger's Player Efficiency Rating PER

Engineering the new feature PER

Evaluate the LG trained with the selected features

Tune an LG with the selected features

Train the tuned LG with the data oversmapled by SMOTE and save prediction result

6. APPROACH 3: USE STATISTICAL ANALYSIS

6A. Select important features to train a default LG and evaluate its performance

We selected features with large differences between the classes' medican values (refer to 3B above).

6B. Tune an LG with selected features

6C. Train the tuned LG with data oversmapled by SMOTE and save prediction result